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In this paper we propose a Bayesian approach for inference about 
dependence of high throughput gene expression. Our goals are to use 
prior knowledge about pathways to anchor inference about depen- 
dence among genes; to account for this dependence while making 
inferences about differences in mean expression across phenotypes; 
and to explore differences in the dependence itself across phenotypes. 
Useful features of the proposed approach are a model-based parsimo- 
nious representation of expression as an ordinal outcome, a novel 
and flexible representation of prior information on the nature of de- 
pendencies, and the use of a coherent probability model over both 
the structure and strength of the dependencies of interest. We eval- 
uate our approach through simulations and in the analysis of data 
on expression of genes in the Complement and Coagulation Cascade 
pathway in ovarian cancer. 

1. Introduction. Inferring patterns of dependence from high throughput 
geneomic data poses significant challenges. Statistically, the problem is one 
of learning about dependence structures in high dimension, with relatively 
low signal. A promising direction for strengthening this inference is the ex- 
plicit consideration of information from known "pathways" — biochemical 
processes described in terms of a series of relationships among genes and 
their products. 

In this paper we take this perspective, and propose a Bayesian approach 
to achieve three related goals in the context of gene expression analysis: 
to use prior knowledge about pathways to anchor inference about depen- 
dence among genes; to account for this dependence while making inferences 
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about differences in mean expression across phenotypes; and to explore dif- 
ferences in the dependence itself across phenotypes. The proposed model 
builds on the POE model (Parmigiani et al. 2002) and integrates inference 
about probability of differential expression with inference about dependence 
between genes through the formulation of a coherent probability model. 
Our proposed inferences are local in the sense that the model is centered 
around a specific pathway. Formally, variable selection is used to remove and 
add structure relative to the centering pathway. This is in contrast to ap- 
proaches aimed at learning dependence structures de novo from expression 
data, without guidance by a prior pathway structure. 

Some of the existing approaches for probabilistic modeling of dependence 
structures attempt to explore the space of all possible graphical models, often 
restricted to Directed Acyclic Graphs (DAGs) or Bayesian networks (BN) 
(Lauritzen 1996) and decomposable models (Dawid and Lauritzen 1993). 
A comprehensive review of statistical methodology for network data is provid- 
ed in Kolaczyk (2009). Recent literature includes the application of BN and 
dynamic BN to microarray data (Murphy and Mian 1999, Friedman et al. 
2000), with applications and extensions of this methodology reported in 
Ong, Glasner and Page (2002) and Beal et al. (2005), among others. Al- 
though appealing, these techniques have computational and methodological 
limitations related to modeling conditional independence under the "large p, 
small n" paradigm and the difficult specification of consistent prior models 
across dimensions (Dobra et al. 2004). Other authors (Scott and Carvalho 
2008, Jones et al. 2005) have reported difficulties with the performance of 
standard trans-dimensional MCMC methods (Giudici and Green 1999) in 
the exploration of the model space, and suggested alternative stochastic 
search schemes. For a decision theoretic perspective on graphical model se- 
lection see Sebastiani and Ramoni (2005). 

To overcome these problems, we focus on variations of a baseline model 
that represents known dependence structures. The centering anchors the 
model space to a prior path diagram elicited from sets of molecular interac- 
tions derived by previous experiments. 

Our idea is similar to the modeling approaches described in Wei and Li 
(2007) and Wei and Li (2008), who introduced conditional independence 
between genes, via a Markov random field defined over binary hidden states 
of differential expression. These authors propose to consider a fixed Markov 
random field mirroring exactly the topology of a prior pathway and ignor- 
ing the directionality of the edges. The construction of dependence patterns 
based on hidden Markov random fields had also been previously explored by 
Broet and Richardson (2006) in the analysis of CGH microarrays. We con- 
trast the approach of Wei and Li (2007, 2008) in three fundamental ways. 
First, we provide an alternative interpretation of the connections encoded 
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into a prior pathway. We develop a prior model for the dependence struc- 
ture, that is, based on the reciprocal graphs (Koster 1996). This class of 
graphical models takes account of the directionality of the edges included in 
the pathway and allows for the Markovian characterization of cycles, which 
often arise in biological depictions of genetic interactions. Also, recognizing 
that a known pathway is often summarizing results obtained under differ- 
ent experimental conditions, we allow for significant deviations from the 
prior dependence structure. This extension requires explicit consideration of 
a model determination strategy, but enables inference on the model parame- 
ters as well as inference on the dependence structure between genes. Finally, 
our focus is on identifying significant interactions between genes in a prior 
pathway, as opposed to identifying differentially expressed genes in a given 
pathway. 

The proposed methodology finds motivation in the analysis of gene ex- 
pression of Epithelial Ovarian Cancer (EOC) patients. In this setting, the 
complement and coagulation cascade pathway (Figure 2) represents a key 
study target, as disease progression is thought to be highly linked to inflam- 
mation and vascularization processes [Wang, Wang and Kavanagh (2005)]. 

The rest of this article is organized as follows. In Section 2 we introduce 
the proposed model. Section 3 discusses estimation and inferential details 
associated with the proposed model. We validate our approach with a sim- 
ulation study in Section 4. Section 5 employs the model for the analysis 
of epithelial ovarian cancer expression data, to derive inference about ac- 
tive genetic interactions. In the example, a well-known molecular pathway 
provides prior information on the dependence structure. A final section con- 
cludes with a critical discussion of limitations and possible extensions. 

2. Dependent probability of expression. In Section 2.1, we discuss graph- 
ical models and notation, and in Section 2.2, we review the POE (Probabil- 
ity of Expression) model Parmigiani et al. (2002), which defines biological 
events via latent three-way indicators of relevant biological states. The origi- 
nal POE model assumes independence across genes, conditional on hyperpa- 
rameters. We extend the original model by formalizing more complex rela- 
tionships among variables via a cascade of conditional dependences, guided 
by a predefined interaction map. The predefined interaction map is formally 
represented as a graph. In Section 2.3 we introduce a prior probability model 
on this graph. 

2.1. Representing dependence through graphical models. Networks of re- 
lationships among expression levels can be represented as graphs that de- 
scribe how genes influence each other [for an example in ovarian cancer see 
Wang, Wang and Kavanagh (2005)]. More formally, a graph is often charac- 
terized as an algebraic structure Q = {V, E}, composed of a set of nodes V, in 
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our case genes, and a set of edges E C {(v^ ,Vi 2 ),Vi € V}U{{«i 1 , Vi 2 },Vi € V}. 
Here (vi,Vj) denotes a directed edge from i>j to Vj, and {vi,Vj} denotes an 
undirected edge. A graph Q defines the Markov properties of a statistical 
model in a graphical fashion, via the specification of a set of conditional 
dependencies. 

Biochemical networks often include the presence of cycles and feedback 
relationships. This requires some care when trying to characterize a coherent 
probabilistic model that accurately portrays prior biochemical knowledge. 
For this purpose, we focus on a class of graphical models known as re- 
ciprocal graphs [Koster (1996)]. Reciprocal graph are defined as a natural 
generalization of other well-known classes, including directed acyclic graphs 
(DAG) and Markov random fields, among others. Reciprocal graphs are de- 
fined through the coherent probabilistic interpretation of directed a{— > b), 
undirected (a — b) and reciprocal edges (a <^ b). Here, for simplicity, we 
consider a subset of the reciprocal graph family excluding undirected edges. 
The restriction to only directed edges will later be important to facilitate 
the mapping of Q to a simultaneous equation model. 

The proposed model and inference is based on the directed graph Q. But 
sometimes it is of interest to describe the implied conditional independence 
structure, that is, the Markov properties. When desired, the Markov proper- 
ties of our model are defined in terms of an undirected graph Q m = {V, E m } 
elicited via moralization [Koster (1996), Lauritzen (1996)] of a graph Q. 
In substance, the moralization procedure consists in adding an undirected 
edge between parents of a common child and replacing the remaining di- 
rected edges with undirected ones. In Q m , standard Markov field proper- 
ties hold, in the sense that two genes i and j are disconnected when they 
are conditionally independent, given the rest of the genes [Besag (1974)]. 
For example, consider the reciprocal graph Q represented in Figure 1. The 
class of Markov equivalent models spanned by Q may be represented with 
the moral (undirected) graph Q m , for which the following Markov prop- 
erty holds: 1_L2|3,4, that is, P(l, 3|2, 4) = P(l\2, 4)P(3|2, 4). The corre- 
spondence between Q and Q m is not 1-to-l as Q m could arise from the mor- 
alization of an entire class of Markov equivalent reciprocal graphs. Further 
details about moralization in reciprocal graphs and covariance equivalence 
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Fig. 1. (Example) moralization of a reciprocal graph. 
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are discussed in Koster (1996) and Spirtes et al. (1998). Here, our infer- 
ence will be based on Q only, and the directionality will be based on prior 
knowledge. The undirected graph Q m provides a convenient summary of the 
conditional independence structure if desired. 

2.2. Dependent gene expression and hidden systems of simultaneous equa- 
tions. Following Parmigiani et al. (2002), we consider data in the form of 
an (p x n) expression matrix Y, with the generic element yij denoting the 
observed gene expression for gene i in sample j, i = 1, . . . ,p and j = 1, . . . n. 
We introduce latent variables eij E {—1, 0, 1} indexing three possible expres- 
sion categories for each entry in Y. For example, if Y represents ratios of 
expression level relative to a normal reference, they can be interpreted as 
high, normal and low. Given e^, for each gene i and each sample j we assume 
a mixture parameterized with 8 = (ay, /ij, ,K,f) as follows: 

r /_!* = [/(-«-, o), 

(1) PiVij - {oLj + m)\eij) = fij(yij) with <j f 0i = N(0, erf), 

In words, we assume that the observed expressions arise from a mix- 
ture of a Gaussian distribution and two uniform distributions designed to 
capture a broad range of departures relative to the Gaussian. The inter- 
pretation of the Gaussian component varies depending on the experimental 
design and sampling scheme, and can be trained in a supervised way if data 
are available (Garrett and Parmigiani 2004). When the technology used for 
measuring expression has an internal reference, as in Section 5, the high 
(low) class can be interpreted as over- (under) expression compared to the 
reference. The upcoming definition of a dependence structure will focus on 
the latent and define dependence at the level of these indicators. In other 
words, the proposed model could be characterized as a boolean network on 
the latent e^. 

In (1), ctj is a sample-specific effect, included to adjust for systematic 
variation across samples; \i% is a gene-specific effect, modeling the overall 
abundance of each gene, and and nf parameterize the limits of varia- 
tion in the tails. Finally, of is the variance of the normal component of the 
distribution of gene i. We follow Parmigiani et al. (2002) in defining a con- 
ditionally conjugate prior for [a, al and and k\ . Let Qa(a,b) denote 
a Gamma distribution with expectation a/b: 

piVilm^Tfj) = Nfa^Tn), p(l/af\-i a ,\ a ) = ga('~i rj ,\ a ), 

P{ l / K 7hKiK) = Ga(-f-,\-), p(l/re+|7+,A+) = £a(7+,A+); 

where mm(tx,f , k~) > kqGi and kq = 5. The restriction on and nf ensures 
that the gene-specific mixture distribution has heavier tails than its normal 
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component, preserving interpretability of the three-way latent classes. For 
the sample-specific effect ay, we impose an identifiability constraint ay ~ 
N(0,<rZ) with E"=o«t = 0. 

Specifying a prior model for e^, we deviate from Parmigiani et al. (2002), 
defining the model in terms of latent normal variables [Albert and Chib 
(1993)]. For each gene and sample we introduce a latent Gaussian vari- 
able Zij, and define 

{1, if > 1 high expression, 

0, if — 1 < z.ij < 1 normal expression, 

—1, if < — 1 low expression, 

where the distribution of z^j is defined by the following simultaneous equa- 
tions model (SEM): 

(3) = rriij +y^Pik(z k j - m k j) + i = l,...,p,j = l,...,n, 

k^i 

with Eij ^ N(0,sf). Let Zj = (z±j , . . . , z p j)' denote the p-dimensional vec- 
tor of latent probit scores associated with sample j. Also, let B be the 
(p x p) matrix whose diagonal elements are unity and whose off-diagonal 
(i,k) components are —fiik- Provided B is nonsingular, the process above 
defines a proper joint probability density function [Besag (1974)]. More pre- 
cisely, defining the marginal precision matrix H 2 = diag(l/si, . .. ,l/s p ) and 
S7 = B'i^B, we have 

(4) P(Zj\mj,n) = M^expj-^Z^ - m,)'f2(Z, - m,)}, 

where rrij = (my, . . . , m p j)' . 

If ej = (e±j, . . . , e p j)' , the implied probabilities for the indicators are 

(5) P(e j \m j ,Sl)= j ■■■ j P(Zj\mj,n)dZj, 

where Aij is the interval (— oo, —1] if e^ = —1, (—1, 1] if = and (1, oo) 
if = 1. We use notation ixf- =p(% > l|y), ir~j =p(zij < -l\y) and p*j = 

In the context of this SEM, we propose to use a reciprocal graph, Q = 
{V, E}, to describe a dependence structure among the three-way indica- 
tors eij that reflects a priori knowledge about a pathway. Relationships be- 
tween genes are captured via a set of conditional independences over the 
joint distribution of the classes ej = (eij,i = 1, . . . ,p). This is implemented 
by structuring the matrix B so that the off-diagonal element (i, k) is null 
{Pik = 0), if an d only if the edge k — >■ i is not in {E} [Spirtes et al. (1998)]. 
The resulting concentration matrix fi = B T // 2 B will have zero off-diagonal 
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elements (ujik = 0) structured compatibly with the Markov properties en- 
coded in the moral graph Q m = {V, E m } [Koster (1996)]. In summary, we 
use the SEM to define a probability model that matches the conditional 
independence structure given by Q. The coefficients B of the SEM index 
a family of probability models that adhere to a given independence struc- 
ture Q, including an interpretation of the edge directions. 

For each gene and sample, the mean rriij may be modeled as a linear 
function (mjj = x^-frj) of, say, a design vector xj. This allows for comparisons 
across groups. For example, if Xj = 1 and —1 for samples under two different 
biologic conditions, then the posterior distribution for bi formalizes inference 
on the differential expression of gene i under the two conditions, adjusting 
for the dependence among the genes. 

Finally, the autoregressive scheme in (3) implicitly assumes that genetic 
interactions are invariant across all the cross-sample biological variation rep- 
resented in the study. Relaxing this assumption is important and can be 
achieved by including an interaction term relating the covariate or pheno- 
type information in with the neighboring probit scores Zkj in (3). 

In summary, we assume a mixture model for the observed gene expres- 
sions yij . The noisy data is reduced to latent trinary indicators which are 
used to define the dependence structure. Because of the nonlinear shrink- 
age induced by the mixture model, the y^ do not come from a multivariate 
normal, and the patterns of dependence could be more complex. 

2.3. Priors over graphical structures and dependence parameters. We de- 
fine a prior probability model for the dependence structure Q. In words, the 
prior is based on a pathway diagram that summarizes substantive prior 
information about the pathway of interest. We interpret the pathway as 
a reciprocal graph Qq = {V, Eq} (see example in Section 2.1). The prior 
on Q is defined on the set of all graphs that can be obtained by deleting 
edges from Qq . More formally, we define the model space generated by Go as 
M(Qo) = {Q = (V, E) : E C Eq}. If Eq comprises a total number of K edges, 
then M(Qq) includes D = 2 K possible models. 

The definition of the the prior p{Q) can be seen as stating joint probabil- 
ities for the multiple hypothesis testing problem implicitly defined by inclu- 
sion versus exclusion of all possible edges. Following the standard Bayesian 
variable selection scheme [George and McCulloch (1993), Brown, Vannucci 
and Fearn (1998), Dobra et al. (2004)], we can consider edge inclusions as 
exchangeable Bernoulli trials with common inclusion probability (p. If kg 
is the number of edges included in Q, it follows that P(Q\ip) = ip ks (l — 
ip) K ~ kg . When the inclusion probability <p comes from the Beta family 
(ip ~ B(ap,bp)), Scott and Berger (2006) and Carvalho and Scott (2009) 
show that this class of prior model probabilities yield a strong control over 
the number of "false" edges included in Q. The associated marginal prior 
on Q becomes p(Q) = T(ng + a v )r(K + b ip - kq)/T(K + b ip + Kg). 
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A key feature of the proposed prior is the restriction to subsets of Go- 
Inference under the proposed model populates existing pathways with prob- 
abilistic information associated with a biological system at a temporal cross 
section of its dynamic. The restriction to M(Go) is important to keep MCMC 
posterior simulation across the model space practicable. For global searches, 
without restriction to a focused set of models, trans-dimensional MCMC 
becomes impracticable. Local focus does not preclude some extensions be- 
yond M(Go) to facilitate discovery of previously unknown interactions. For 
example, consider an arbitrary graph Q, without restriction to M(Go), and 
let mg denote the number of deleted and added edges relative to Go- One 
could replace kg in the prior by mg and allow for graphs beyond Go- Little 
would change in the proposed inference. But centering on models close to Go 
is important. See also related comments in Section 6. 

Our model is completed defining priors over the nonzero parameters f3ij ~ 
./V(0,cr|) = 1, . . . ,p). This defines a conjugate prior for the normal mod- 
el (3). This formulation is derived as a natural characterization of the SEM 
in (3). 

We recognize that assuming full exchangeability over the edges does not 
make active use of potential prior information on inter-gene relationships, 
possibly available through public data-bases like KEGG or Gene Ontology. 
We note, however, that fine scale prior information on individual interac- 
tions is easily included in the proposed inferential framework defining par- 
tially exchangeable or independent Bernoulli trials with interaction-specific 
inclusion probabilities, say, (fij. If desired, the model can be extended with 
Beta(ajj, Cy) for ip^ , with hyperparameters chosen to reflect interaction sum- 
maries perhaps elicited through available tools like the R package GOSim. 
These elicitation processes are, however, still the subject of active research 
[Frohlich et al. (2007), Mistry and Pavlidis (2008)]. We therefore limit our 
analysis to purely structural priors. 

3. Estimation and inference. 

3.1. Model determination via RJ-MCMC. Let denote all population 
parameters and unknown quantities directly associated with the sampling 
model introduced in (1). We implement posterior inference for (0,B,£/) by 
setting up posterior MCMC simulation. We define the current state x = 
(0,~B,G) as the complete set of unknowns and write 7r(dx) short for the 
target posterior distribution p(6,~B,G\Y). 

The MCMC is defined by the following transition probabilities: (a) Update 
the parameter vector (0,B); and (b) Update G, ensuring that the proposed 
graph Q' is in the set M{Go}- This move usually involves changes to B as 
well. 
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The updates in (a) follow the usual Metropolis-within-Gibbs scheme. We 
sample components of directly from their conditional posterior distribu- 
tions (Gibbs sampling details are reported in the Appendix). We update the 
matrix B by row via multivariate random-walk Metropolis-Hastings transi- 
tion probabilities. Let pa(i) denote the parent nodes of node i in the directed 
graph Q . We define the ith row of B as [3 i and propose a new state fl\ \fl va u\ ~ 

N\ P a(i)\ (Al/3 P a«; Vp,d, where V li = ^l/sjWfWi + l/ap)-\ Here Wj is an 
(n x |pa(i)|) design matrix including all mean adjusted probit scores for par- 
ents of gene i and c is a Metropolis-Hastings tuning parameter. For each 
row, this proposal scheme changes B to B', defining a local approximation 
of a reciprocal graph by a directed acyclical graph. Letting Z denote a p x n 
matrix of mean-adjusted probit scores, the proposed transition is accepted 
with probability 



1; TB~F etr 



--Z T (B' — B) T HJB' — B)Z 



Some care is needed for the updates in (b), as they involve adding or 
deleting an edge in Q, therefore changing the dimension of the parameter 
space. We implement a reversible jumps MCMC (RJ) [Green (1995)]: 

(i) Draw an edge (k — > i) at random from Eq. If in the current state G, 
(k — > i) E, propose the birth of the new edge k — > i. If (k — >■ i) £ E, propose 
the death of k — > i. 

(ii) If we propose the birth k — >■ i, the structural matrix B gets populated 
with a new element (3' ik = u, where u ~ q(u). If we propose the death of edge 
k — > i, we simply set j3' ik = 0. 

Steps (i) and (ii) generate a candidate x' = (B',Q'). Let m = index the move 
proposed in step (i), and let m! index the reverse move. The acceptance 
probability is [Green (1995)] 

(R \ tji t\ ■ L Tr(dcc') q(m'\x') 

(6) R(x,x ) = mm< 1 



ir(dx) q{m\x)q(u) 

where q(m\x) is the probability of proposing move m when the chain is in 
state x, and q(u) is the density function of u. In general, R(x,x') might 
include an additional factor involving the Jacobian of a possible (determin- 
istic) transformation of (x,u) to define x' . The described RJ involves no 
such transformation. The move m is generated in step (i) by a uniform draw 
from Eq, implying q(m\x) = q(m'\x'). Finally, q(u) is the proposal p.d.f. The 
acceptance probability of a birth is then defined as 

f p(x'\Y) . . 
Rb = min< 1; q[u) 



p(x\Y) 



\B'\ n 
min<; 1; — — — etr 

B h 



--Z T (B' - B) T H Z (B' - B)Z 



. (i - <p)q(u) 
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Even though nonsingular matrices define a dense open set in W, if the 
proposed element (3' ik of B' defines a a numerically singular matrix, ft will 
not be positive definite and we reject move m! setting to zero. Given 
this sampling scheme, the probability of a deletion is simply defined as Rd = 
1/Rb, with the roles of x' and x as currently imputed and proposed state 
reversed. 

3.2. Graphical model selection. The posterior probability p((?,B|Y) and 
the corresponding MCMC posterior simulation characterize our knowledge 
about the pathway in the light of the data. Based on this posterior probabil- 
ity, we may be interested in selecting a representative graph Q. The posterior 
only summarizes the evidence for each Q. It does not yet tell us which Qs 
we should finally report. 

This model selection problem has been discussed by different authors. 
Drton and Perlman (2007) discuss graphical model selection from the fre- 
quentist perspective, under the assumption that n > p + 1, while Jones et al. 
(2005) or Meinshausen and Biihlmann (2006) describe selection techniques 
for problems where the sample size n is small when compared to the num- 
ber of variables p. Prom a Bayesian perspective, Carvalho and Scott (2009) 
provide a comprehensive discussion of Objective Bayesian model selection 
in Gaussian Graphical Models. 

In the context of the model described in Section 2, graphical model se- 
lection can be defined by removing elements (&—>•*)€ Eq specified by the 
prior graph Qq = {V, Eq}. This is equivalent to the vanishing of the struc- 
tural parameters in the matrix B, characterizing the joint distribution 
of latent probit scores Z [Ronning and Kukuk (1996)]. If the edge set Eq 
has size \Eq \ = Q, graphical model selection involves testing Q hypothesis 

H°:/3 {q) =0 vs. H l q :(3 {q) ^0 for q = 1, . . . , Q. 

When testing a large number of hypotheses it is important to address possi- 
ble multiplicity problems by controlling some predefined error rate. A pop- 
ular choice is to control the False Discovery Rate (FDR) [Benjamini and 
Hochberg (1995)]. Several authors [Carvalho and Scott (2009), Scott and 
Carvalho (2008), Scott and Berger (2006)] consider the shrinkage prior de- 
fined in Section 3.1 and report how including edges with inclusion probability 
P{Pik 7^ 0) > 0.5 (median model) yields strong control over the number of 
false positives. 

4. Simulation study. We validate and illustrate the proposed method 
with a simulation study with p = 50 genes from n = 30 samples. We de- 
fine Y as the (p x n) matrix of simulated mRNA intensities and consider 
a balanced design where 15 columns of Y are from "normal" samples and 
15 columns of Y are associated with "tumor" samples. Thus, = (1,0)' 
if yij is a normal sample and = (1, 1)' if is a tumor sample. 
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We generate simulated data Y as follows. Given a set of latent scores W ~ 
MJ\f(0, Yi z , It), where ft z = S" 1 encodes a known conditional dependence 
structure, and covariate effects bj ~ iv^m,;, <t 2 / 2 ), we define Zij = Wij + 
x^-bj. We then generate the intensity matrix Y from a three-way mixture 
of Gaussian distributions: 

ViMi <-l~A r (-4,2 2 ), 

(7) y lJ \z ij >3^N(4,2 2 ), 

Vijl-Kzij <3~iV(0,l). 

The precision matrix fl z is defined as follows. First we obtain the pxp matrix 
B by defining 7^ =^ Ga(2, 1), Cij = {—1, 1} with P(cij = 1) = 0.5 and <5o 
a Dirac mass at 0, so that Bjj = 1 (for i = 1, . . . ,p), and the off-diagonal 
elements Bjj = ttq5q + (1 — iro) c ij7ij- The simulation truth is deliberately 
chosen differently from the assumed analysis model (1). 

We then generate tt z by rescaling B'B to a correlation matrix. The sim- 
ulation model (7) is deliberately different from the assumed analysis model, 
but still includes a meaningful notion of true dependence structure and 
strength. 

We use a prior Graph Qq = {V,Eq} spanned by the set of edges E = 
E* U E, with E* spanning the simulation truth of nonzero elements in Bjj 
(in our example \E*\ = 50) and E serving as a random mispecification set 
including false edges (in our example \E\ = 87). 

In Figure 3 we display the classification results for the expression mea- 
surements generated under the dependence schemes just described. We cal- 
culated posterior probabilities of over- and under-expression from 50,000 
posterior samples (thinned by 10), obtained after conservatively discard- 
ing 50,000 iterations. Our C ++ implementation of the algorithm, described 
in Section 3.1, performed this simulation in about 6 hours on a standard 
desktop (2.94 GHz processor). 

Figure 3 (left panel) shows the simulation truths as indicators (eij) of 
over- (white), normal- (grey) and under-expression (black). The right panel 
reports a unidimensional summary of the probabilities of over- or under- 
expression (p*j = kJ- — 7r^). The elements p*j are defined in the [—1, 1] scale 
and may be compared directly with the three-way indicators e g t- We note 
that the p* scale provides improved resolution over genes with signal and 
recovers well the generating truth. 

Posterior inference includes a posterior distribution on the dependence 
structure. In Figure 4 (left panel) we report the number of edges included in 
the model by MCMC iteration, for two chains starting at opposite sides of 
the model saturation spectrum. Despite the size of the mispecification set E, 
the trans-dimensional Markov chains converge fairly rapidly toward models 
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Fig. 2. Complement and coagulation cascades pathway [Wang, Wang and Kavanagh 
(2005)]. 

of size comparable to \E*\ = 50. In the same figure, marginalizing over all 
possible graphs M{Qq}, we report the posterior expected SEM coefficients 
E(($ik\Y) and the edge inclusion probabilities P(fiik 7^ 0|Y) (right panel). In 
this plot, we report the false edges as solid circles. Most solid circles lie in 
the area below an inclusion probability of 0.5. This shows how the adopted 




10 15 20 25 30 5 10 15 20 25 30 5 10 15 20 25 30 

Sample Sample Sample 



Fig. 3. Simulation study: (Left panel) Simulation signal eij. (Central panel) Simulated 
mRNA abundance yij. (Right panel) DepPOE estimate ofp*j. 
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Fig. 4. Simulation study: (Left panel) Number of edges included in the model by MCMC 
iteration, for two chains with starting points at the two extremes of the saturation spec- 
trum. (Right panel) Posterior expected SEM coefficients E(/3ik\Y) vs. posterior inclusion 
probabilities P(/3ik 7^0). False edges are represented with a solid circle. 



probability scheme not only penalizes for model complexity, but effectively 
controls the number of false discoveries, allowing for a genuine recovery of 
the generating conditional dependence structure. 

In our simulation experiments we found that selecting edges with poste- 
rior inclusion probabilities greater than 0.5 tends to control the false dis- 
covery rate at level 0.01. We compared our model to the (independence) 
PoE model of Parmigiani et al. (2002) and found that including network 
inference as a new inferential goal does not diminish the classification ac- 
curacy of under- and over-expressed samples. For details see the Web-based 
supplement [Telesca et al. (2011)]. Furthermore, comparison with standard 
global search algorithms based on dynamic shrinkage of partial correlation 
estimates point to substantial inferential gains associated with the proposed 
methodology (see Web-based supplementary material, Section 3). 

5. Case study. Wang, Wang and Kavanagh (2005) report a study of ep- 
ithelial ovarian cancer (EOC). The goal of the study is to characterize the 
role of the tumor microenvironment in favoring the intra-peritoneal spread 
of EOC. To this end, the investigators collected tissue samples from patients 
with benign (b) and malignant (m) ovarian pathology. Specimens were col- 
lected, among other sites, from peritoneum adjacent to the primary tumor. 
RNA was co-hybridized with reference RNA to a custom made cDNA mi- 
croarray including combination of the Research genetics RG_HsKG_031901 
8k clone set and 9000 clones selected from RG_Hs^eq_ver_070700. A com- 
plete list of genes is available at http : //nciarray .nci .nih.gov/galjfiles/ 
index . shtml, "custom printings." See the array labeled Hs_CCDTM-17.5k- 
lpx. 

In the following discussion we focus on the comparison of 10 peritoneal 
samples from patients with benign ovarian pathology (bPT) versus 14 sam- 
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pies from patients with malignant ovarian pathology (mPT). The raw data 
was processed using BRB Array Tool (http://linus.nci.nih.gov/BRB- 
ArrayTools.html). In particular, spots with minimum intensity less than 
300 in both fluorescence channels were excluded from further analysis. See 
Wang, Wang and Kavanagh (2005) for a detailed description. 

One subset of genes reported on the NIH custom microarray are 61 genes 
in the coagulation and complement pathway from KEGG (http://www. 
genome.ad.jp), shown in Figure 2. Genes on this pathway are of interest 
for their role in the inflammatory process. The arches in the pathway are 
interpreted as prior judgement about (approximate) conditional dependence 
(Section 2.1). However, recognizing that the pathway represents a protein 
system rather than gene expression, we allow for significant deviation from 
this structure, explicitly including model determination in our analysis. 

We fit the model presented in Section 2 to this set of 61 genes. The prior 
set of conditional dependences between genes is represented as a reciprocal 
graph in Figure 2 and includes a set of 148 possible edges. Reported inference 
is based on 50,000 MCMC samples, thinned by 10, after discarding 50,000 
observation for burn-in. 

Recording the number of times the sampler visits a particular edge, we 
calculate the posterior probability = P(/%j.|Y), for each edge (k — > i) in 
the prior graph Qq. In Figure 5 (Panel b) we show the set of selected genetic 
interactions when we consider edges with inclusion probabilities greater than 
0.5 (median model). Edge directionality is inherited from Qq (Figure 2). 

The posterior distribution on e g provides inference on differential expres- 
sion, appropriately adjusted for dependence. Starting from the Complement 
and Coagulation Cascade pathway, we identify a set of 24 genes exhibiting 
patterns of dependence in their differential expression profiles across healthy 
and tumor tissues. 

To interpret our findings, we searched the scientific literature using the In- 
formation Hyperlinked Over Protein (IHOP) tool implemented by Hoffman 
[Hoffman and Valencia (2004)], available at: www.ihop-net.org. For exam- 
ple, our study confirms the centrality of the peptide IL8 (Inteleukin-8) in 
the regulation of the chemokine (CXC and CC motifs) genes. The protein 
encoded by this gene has been reported by several authors to play an im- 
portant role in the response to inflammatory stimuli, resistance to apoptosis 
and tumoral angiogenesis. See Terranova and Rice (1997) or Brat, Bellail 
and Erwin (2005) for comprehensive discussions on IL8 and its receptors. 
One other example is the finding of dependent expression profiles associated 
with the Thrombine pathway (F2 — > F2R and F2 — > THBD). This pathway 
plays a central role in the coagulation cascade and has been reported as 
a potential mediator of cellular function in the ovarian follicle [Roach et al. 
(2002)]. 
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(b) 

Fig. 5. Case study. Panel (a,): Posterior mean degree and associated 95% credible inter- 
vals by gene. Panel (b): Posterior pathway obtained selecting edges with inclusion proba- 
bilities greater than 0.5 (Median model). 



Posterior edge inclusion probabilities allow for the calculation of networks' 
summaries at the gene level, which summarize the role played by individual 
genes in the prior pathway. In Figure 5 (Panel a), we report the poste- 
rior distribution of the degree of the node associated with each gene. This 
quantity is simply defined as di = p(\ne(i)\\Y) , the posterior distribution of 
the number of neighbors associated with each gene. This measure is often 
used in social science as a way to summarize an individual's centrality in 
a relational network [Sabidussi (1966)]. From a molecular biology perspec- 
tive, genes with a high degree may be interpreted as playing active roles in 
the regulation of the pathway under study, in association with the biolog- 
ical process of interest. Our analysis of the degree distribution in Figure 5 
(Panel a) confirms what we observe in the selected posterior set of genetic 
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interactions (Panel b) and identifies important active components of the 
complement and coagulation pathway. For example, we confirm the central 
role of C3-convertase in the promotion and progression of malignant ovarian 
cancer in humans, often reported as a key activation component in mouse 
studies [Markiewski and Lambris (2009)]. 

6. Discussion. We propose a probability model for the analysis of de- 
pendent gene expression data. Dependence between genes is modeled via 
the explicit consideration of prior information from pathways represent- 
ing known biochemical processes. We characterize a biochemical pathway 
as a reciprocal graph depicting a coherent set of conditional dependence 
relationships between three-way classes of gene under-, normal- and over- 
expression. Modeling dependence between latent indicators of class mem- 
bership is likely to represent a more sensible approach for this kind of data, 
when compared with methods that model correlations between observables 
directly. Acknowledging that a known pathway represents only prior infor- 
mation, we seek posterior inference for the model parameters as well as for 
the pathway itself via an RJ-MCMC scheme. We showed, through simula- 
tion studies, that our model enables the recovery of the true dependence 
structure, even under a misspecified prior pathway. 

Our model of mRNA abundance relies on the Probability of Expression 
(POE) Model of Parmigiani et al. (2002), and assumes that the variability of 
expression across tissue samples can be fully characterized by heavy tailed 
mixtures of Normal and Uniform random variables. While this is a simplifica- 
tion of reality, it contributes to denoising data and is likely to provide useful 
summaries, allowing for the investigation of the many aspects associated 
with expression data analysis, from data normalization, to DE analysis, to 
the characterization of molecular profiles. The general framework presented 
in this article is also adaptable to other models of gene expression analysis. 

In the construction of the dependent probability model, it is important 
to acknowledge the limitations of the information provided in a biochemical 
network. In fact, a pathway may not necessarily describe relations among 
transcript levels, although it carries some information about it. The proposed 
methodology is currently restricted to known biochemical pathways. Never- 
theless, structural restrictions to one or more pathways of interest substan- 
tially simplifies computational tractability. The proposed model complexity 
is, in fact, only linear in the number of genes and interactions included 
in the prior graph. In our simulation example this provided substantially 
higher power in the detection of meaningful interactions, when compared to 
standard global search strategies. Computational scalability of the proposed 
methodology could, however, be an issue, when considering highly saturated 
pathways including a large number of genes. In these cases, methods based 
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on simplifying assumptions and approximate inference may indeed prove 
more feasible as exploratory analytical tools [Dobra et al. (2004)]. 

Our model could be extended to discover novel genetic interactions, by 
allowing the adding of new edges between nodes in the prior graph Qq. This 
extension would, however, come at a substantial computational cost and 
would require a challenging reformulation of the prior over graphs p(G), to 
penalize for model complexity and, at the same time, to favor models closer 
to the structure of the prior pathway Qq. Initial progress in this direction 
was reported by Braun, Cope and Parmigiani (2008) and, in the context of 
Bayesian Networks, by Mukherjee and Speed (2008) and it is the subject of 
active research. 

In this article we model dependence between three-way variables as depen- 
dence between latent Gaussian quantities. This probability scheme is only 
a convenient restriction on the possible shapes of dependence characterizing 
a matrix of ordinal random variables. Extensions of our model considering 
a richer class of dependence structures are, in principle, appealing. How- 
ever, these changes would require a higher level of complexity and possible 
ad hock limitations on the clique size contributing to the joint distribution 
of the three-way indicators. 

APPENDIX: FULL CONDITIONAL DISTRIBUTIONS 

Sample-specific means at- Prom Section 2.2 we have that p(aj\r 2 ) oc 
exp{— a|/(2r^)}. Using standard conjugate analysis, it is easy to show that 
«jl Y jA^ ~ N (a*,v* aj )I(l* < a, < u*), where v* aj = {t~ 2 + >], ^ 2 /U = 
0)}- 1 , a* = v* aj YldiVij ~ Vi)/o 2 iI(eij = 0)}, I* = niax:,, : , x A, hj - fa - fc+) 
and u* = min {i:ey= _ 1 }(y ij - fa + fcr). 

Gene-specific means fa. From Section 2.2 we have that p(fa\m^,T 2 ) oc 
exp{— (fa 2 — 2m IJ ,fa)/(2T 2 )}. Using standard conjugate analysis, it is easy to 
show that fj,i\Yi,0\^. ~ N(fa*,v*)I(l* < fa < u*), where v* = {t~ 2 + 
*r 2 £iJ(ey = 0)}" 1 , 1 K = <{m M /r2 + af^iVii ~ a ^3=o)}, I* = 
n ^\.j,,, l}(Vij ~ oij - kf) and u* = min {j:e . - aj + fcr). 

Gene-specific variances a 2 . We introduced a conditionally conjugate In- 
verse Gamma prior for af in Section 2.2. For ease of notation we define 
hi = 1/of , and noi = J2j H e ij = 0)- It is easy to verify that hi\Yi, 6\ hi ~ 
G&mma(a*,b*)I{hi > (« / mm(k~ , kf)) 2 }, where a* = 7 CT + n 0i /2 and b* = 
A <r + =0)(y<j -fa- ocjY/2. 

Uniform bounds k~ and kf . For ease of notation we define z/jo = 1/^ and 
vn = l/kf . We have p{yu) oc vfjf~ e~^ kV B l , (£ = 0, 1). The conditional poste- 
rior distribution of these parameters is defined as Vie\Yi,0\ Vu ~ Gamma(a*£, 
b* £ ) x I(Sa), where a* e = Jk + J2j I ( e ij =2^-1), b* e = \ k and S ie = {v it : v it < 
min[min {i . e . j=2 e-i}(yij -fa- aj), (kq^)" 1 ]}- 
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Probit score precisions 1/sf. For ease of notation we define h s i = 
Taking advantage of conditional conjugacy with the distribution of pro- 
bit scores, we define p(h S i\a s ,b s ) oc h a s l~ l exp{— b s h s i}. Let Zij = Zij — rriij — 
Ylkyti fiik (zkj — m-kj)' ■ It is easy to show that the conditional posterior density 

of h si is then Gamma withp(/i si |Yj, 0\ hs J oc h n s [ 2+as " 1 exp{-h si J2j(^ij)/ 2 }- 

SUPPLEMENTARY MATERIAL 

Convergence diagnostics and model comparisons 

(DOI: 10.1214/11-AOAS525SUPP; .pdf). We provide an extended discus- 
sion of some aspects associated with the proposed model. In particular, we 
compare our results to the PoE model of Parmigiani et al. (2002) as well as 
some current methods used to infer networks. 
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